Fast and Simple Character Classes and Bounded Gaps Pattern Matching, with Applications to Protein Searching

نویسندگان

  • Gonzalo Navarro
  • Mathieu Raffinot
چکیده

The problem of fast exact and approximate searching for a pattern that contains classes of characters and bounded size gaps (CBG) in a text has a wide range of applications, among which a very important one is protein pattern matching (for instance, one PROSITE protein site is associated with the CBG [RK] - x(2,3) - [DE] - x(2,3) - Y, where the brackets match any of the letters inside, and x(2,3) a gap of length between 2 and 3). Currently, the only way to search for a CBG in a text is to convert it into a full regular expression (RE). However, a RE is more sophisticated than a CBG, and searching for it with a RE pattern matching algorithm complicates the search and makes it slow. This is the reason why we design in this article two new practical CBG matching algorithms that are much simpler and faster than all the RE search techniques. The first one looks exactly once at each text character. The second one does not need to consider all the text characters, and hence it is usually faster than the first one, but in bad cases may have to read the same text character more than once. We then propose a criterion based on the form of the CBG to choose a priori the fastest between both. We also show how to search permitting a few mistakes in the occurrences. We performed many practical experiments using the PROSITE database, and all of them show that our algorithms are the fastest in virtually all cases.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Dynamically Reconfigurable FPGA-Based Pattern Matching Hardware for Subclasses of Regular Expressions

In this paper, we propose a novel architecture for largescale regular expression matching, called dynamically reconfigurable bitparallel NFA architecture (Dynamic BP-NFA), which allows dynamic loading of regular expressions on-the-fly as well as efficient pattern matching for fast data streams. This is the first dynamically reconfigurable hardware with guaranteed performance for the class of ex...

متن کامل

Modulated string searching

In his 1987 paper entitled Generalized String Matching Abrahamson introduced the concept of pattern matching with character classes and provided the first efficient algorithm to solve this problem. The best known solution to date is due to Linhart and Shamir (2009). Another broad yet comparatively less intensively studied class of string matching problems is numerical string searching, such as ...

متن کامل

Simple deterministic wildcard matching

We present a simple and fast deterministic solution to the string matching with don’t cares problem. The task is to determine all positions in a text where a pattern occurs, allowing both pattern and text to contain single character wildcards. Our algorithm takes O(n logm) time for a text of length n and a pattern of length m and in our view the algorithm is conceptually simpler than previous a...

متن کامل

A New RSTB Invariant Image Template Matching Based on Log-Spectrum and Modified ICA

Template matching is a widely used technique in many of image processing and machine vision applications. In this paper we propose a new as well as a fast and reliable template matching algorithm which is invariant to Rotation, Scale, Translation and Brightness (RSTB) changes. For this purpose, we adopt the idea of ring projection transform (RPT) of image. In the proposed algorithm, two novel s...

متن کامل

New and Efficient Recursive-based String Matching Algorithm (RSMA-FLFC)

The need for simple and efficient string matching algorithms is essential for many applications, and especially for database query. In this paper, two major algorithms are proposed, namely first least frequency character algorithm (FLFC) and recursive-based string matching algorithm (RSMA). FLFC is considered as an enhanced version of scan for lowest frequency character SLFC proposed by Horspoo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Journal of computational biology : a journal of computational molecular cell biology

دوره 10 6  شماره 

صفحات  -

تاریخ انتشار 2003